Correcting English Text Using PPM Models
نویسندگان
چکیده
An essential component of many applications in natural language processing is a language modeler able to correct errors in the text being processed. For optical character recognition (OCR), poor scanning quality or extraneous pixels in the image may cause one or more characters to be mis-recognized; while for spelling correction, two characters may be transposed, or a character may be inadvertently inserted or missed out. This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a signiicant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been recognized by a state-of-the-art commercial OCR system. We show that the accuracy of the OCR system can be increased from 96.3% to 96.9%, a decrease of about 14 errors per page. 1 Motivation In order to fully evaluate the performance of text compression algorithms, large bodies of material have to be made available in machine readable form. Optical character recognition (OCR) systems provide a fast and relatively inexpensive option for acquiring such text. However, when confronted with digitizing over 3,000 pages of Dumas Malone's Jeeerson and his time (Malone, 1977), we found that the time taken to correct the text once it had been processed by an OCR system was prohibitive. Some common mistakes made by the OCR software were the letter c being confused with the letter e (for example, the word thc occurring instead of the, and Jcccrson instead of Jeeerson), problems with the letters m and w, l's being replaced by i's, the mis-recognition of question as guest ion and upper case characters appearing in the middle of words. The task of correcting these errors added substantially to the time required to complete the project. (On average, a further three minutes was required to correct each page.) This paper describes a method of correcting these errors. The next three sections describes the theoretical background to the approach we adopt. After that, we describe how the method can be applied to two speciic problems|automatic segmentation of words in text, and improving OCR output.
منابع مشابه
THE ENTROPY OF ENGLISH USING PPM-BASED MODELS - Data Compression Conference, 1996. DCC '96. Proceedings
Over 45 years ago Claude E. Shannon estimated the entropy of English to be about 1 bit per character [16]. He did this by having human subjects guess samples of text, letter by letter. From the number of guesses made by each subject he estimated upper and lower bounds of 1.3 and 0.6 bits per character (bpc) for the entropy of English. Shannon’s methodology was not improved upon until 1978 when ...
متن کاملAdaptive models of Arabic text
The main aim of this thesis is to build adaptive language models of Arabic text that can achieve the best compression performance over existing models. Prediction by partial matching (PPM) language models has been the best performing over the other adaptive language models through the past three decades in term of compression performance. In order to get such performance for Arabic text, the ri...
متن کاملBilingual Random Walk Models for Automated Grammar Correction of ESL Author-Produced Text
We present a novel noisy channel model for correcting text produced by English as a second language (ESL) authors. We model the English word choices made by ESL authors as a random walk across an undirected bipartite dictionary graph composed of edges between English words and associated words in an author’s native language. We present two such models, using cascades of weighted finitestate tra...
متن کاملText Classification Using Word-Based PPM Models
Text classification is one of the most actual among the natural language processing problems. In this paper the application of word-based PPM (Prediction by Partial Matching) model for automatic content-based text classification is described. Our main idea is that words and especially word combinations are more relevant features for many text classification tasks. Key-words for a document in mo...
متن کاملCorrecting Spelling Errors by Modelling Their Causes
This paper accounts for a new technique of correcting isolated words in typed texts. A language-dependent set of string substitutions reflects the surface form of errors that result from vocabulary incompetence, misspellings, or mistypings. Candidate corrections are formed by applying the substitutions to text words absent from the computer lexicon. A minimal acyclic deterministic finite automa...
متن کامل